\begin {table}[t]
 \centering
  \singlespacing
 \caption{{Application characteristics}: {\it Read\%}:Denotes the percentage of reads to the L2 cache out of the total L2 accesses,
 {\it Write\%}:Denotes the percentage of writes to the L2 cache out of the total L2 accesses, {\it Intensity}: Read/Write intensive based on read\%/write\%.}
 \vspace{0.2in}
 \label{table:benchmark}
 \scriptsize
\begin{tabular}{||c|c|c|c|c||c|c|c|c|c||}
\hline
\# & PARSEC 2.1 & Read\% & Write\% & Intensity & \# & SPEC 2K6 & Read\% & Write\% & Intensity \\
\hline \hline
1 & blackscholes  &  91.9 &  8.1 &   Read &  13 &  bzip2 & 86.2 & 13.8 & Read \\
\hline
2 &     bodytrack  &     92.2 &  7.8 &   Read &  14 & gcc &  99.4 &  0.6 & Read \\
\hline
3 &     dedup  & 73.8 &  26.2 &  Write & 15 &    mcf & 94.5 &  5.5 & Read \\
\hline
4 &     facesim (fcsim.) & 78.7 &  21.3 &  Read &  16 & leslie3d & 70.7 & 29.3 & Write \\
\hline
5 &     ferret (frrt.) & 46.2 &  53.8 &  Write & 17 & namd &  92.7 &  7.3 & Read \\
\hline
6 &     fluidanimate (fluid.) &  82.4 &  17.6 &  Read &  18 & soplex & 59.6 & 40.4 & Write \\
\hline
7 &     freqmine (freq.) & 72.1 & 27.9 & Write & 19 & hmmer &   63.6 &  36.4 &  Write \\
\hline
8 &     rtview (rtvw.) & 64.1 & 35.9 &  Write & 20 & sjeng & 76.6 &  23.4 &  Write \\
\hline
9 &     streamcluster & 98.4 &  1.6 &   Read &  21 & libquantum(libq.) & 100.0 & 0.0 & Read \\
\hline
10 &    swaptions (swpts.) & 49.9 & 50.1 & Write & 22 & lbm & 15.7 & 84.3 & Write \\
\hline
11 &    vips &  75.0 &  25.0 &  Write &  23 & GemsFDTD & 99.2 & 0.8 & Read \\
\hline
12 &    x264 &  95.5 &  4.5 & Read &  24 & omnetpp & 97.7 & 2.3 & Read \\
\hline
& & & & & 25 &  h264ref & 57.8 &  42.2 &  Write \\
\hline 
\end{tabular}
\end{table}



\begin {table*} [t]
 \scriptsize
  \centering
 \caption {{Baseline processor, cache, memory and network configuration}} \label{table:sim_config}
 \begin{tabular}{|l|l|}
 \hline
Processor Pipeline & 2 GHz processor, 64-entry instruction window, Fetch/Exec/Commit width 8 \\
\hline
L1 Cache (SRAM) & 32 KB per-core (private) I/D cache, 4-way set associative, 64B block size, write-back, 10 MSHRs \\
\hline
L2 Cache (SRAM or STT-RAM) &  1MB (SRAM) or 4MB (STT-RAM) bank, shared, 16-way set associative, 64B block size, 10 MSHRs \\
\hline
Network & Ring network, one router per bank, 3 cycle router and 1 cycle link latency \\
\hline
Main Memory & 4GB DRAM, 400 cycle access \\
\hline
\end{tabular}
\end{table*}

%\noindent\textbf{Experimental Setup:}
We evaluate our designs using a modified ALPHA M5 Simulator \cite{M5} .
We operate the M5 Simulator in Full System (FS) mode for PARSEC 2.1 applications and in the System Emulation (SE) Mode
for the SPEC 2K6 multiprogrammed mixes. We model a 2GHz processor with four out-of order cores.
The memory instructions are modeled through M5 detailed memory hierarchy. We modified the M5 simulator to model
L2 cache banks composed of tunable retention time STT-RAM cells. Table~\ref{table:sim_config} details our experimental system configuration.

%\noindent\textbf{Collection of Results:}
We report results of 12 multithreaded PARSEC 2.1 applications and 14 multiprogrammed mixes of 4 SPEC 2K6 applications.
Multiprogrammed mixes are chosen randomly from the set of 13 SPEC 2K6 applications.
Table ~\ref{table:benchmark} shows the properties of PARSEC 2.1 and SPEC 2K6 applications.
We use sim-small input for PARSEC 2.1 applications and report the results of only Region of Interest,(ROI)(except for facesim, where we report results for only 2B instructions of ROI)
after warming up the caches for 500M instructions and skipping the initialization and termination phases. For the SPEC multiprogrammed mixes,
we fast forward 1B Instructions, warm up caches for 500M instructions and then report
results for 1B instructions.

\noindent\textbf{Design Choices:}
We report the results for the following L2 cache configurations:

%\squishlist

$\bullet$ {\bf S-1MB:} This is our baseline scheme, where all L2 cache banks are
composed of SRAM cells. Capacity of each bank is 1MB.

$\bullet$ {\bf S-4MB:} This is an ideal case, where all L2 cache banks
are composed of SRAM cells. Capacity of each bank is 4MB and each bank
has the same read and write latency as that of S-1MB. This case is analyzed to see the potential benefit
of having a 4x improvement in cache capacity at 4x area density, while still having read/write latencies comparable to SRAM.
This ideal but hypothetical design has the capacity, area and leakage properties of a STT-RAM but the read/write latencies of an SRAM cache.

$\bullet$ {\bf M-4MB:} This is our baseline scheme for STT-RAM design, where
all L2 cache banks are composed of 10 year retention time STT-RAM cells.
Capacity of each bank is 4MB and each bank occupies the same area as that of an SRAM bank.

$\bullet$ {\bf Volatile M-4MB(1sec):} This design is used to evaluate our
 Volatile STT-RAM Scheme described in Section~\ref{sec:implementation}, where all L2 cache
banks are composed of 1 sec retention time STT-RAM cells.

$\bullet$ {\bf Volatile M-4MB(10ms):} This design is similar to Volatile M-4MB(1sec)
except that the retention time of STT-RAM cells is 10 ms.

$\bullet$ {\bf Revived M-4MB(10ms):} This design is used to evaluate our
 Revived STT-RAM Scheme described in Section ~\ref{sec:implementation}, where all L2 cache
banks are composed of 10 ms retention time STT-RAM cells. All the
results are for the design with 8 MRU Slots and 1900 Buffer Slots.

All 4MB STT-RAM banks occupy the same area as that of a 1MB SRAM bank (because STT-RAM cells are 4x denser), but have varying write latencies due to varying retention times.
%\squishend

\noindent\textbf{ Evaluation Metrics:}
For the multithreaded PARSEC 2.1 applications, we assume 4 threads are mapped to our modeled
processor with four cores. We report normalized speedup for these applications,
which is defined as the improvement over the slowest thread.
For the multiprogrammed SPEC applications, we report Instruction Throughput and Weighted Speedup.
We define instruction throughput (IT) to be the sum of all the number of instructions committed
per cycle in the entire CMP (Eq. (5)).  The weighted speedup (WS) is defined as the slowdown experienced
by each application in a multiprogram mix, compared to its run under the same
configuration when no other application is running on the other cores(Eq.(6)). For analyzing the energy behavior, we measure the leakage energy, dynamic energy and total energy for all designs.

{
%\scriptsize{
 $Instruction$ $throughput$ $=$ $\displaystyle\sum_{i} IPC_{i}$ \hspace{1mm} \textbf{(5)} \hspace{4mm}
 $Weighted$ $speedup$ $=$ $\displaystyle\sum_{i}
\frac{IPC_{i}^{shared}}{IPC_{i}^{alone}}$ \hspace{1mm} \textbf{(6)}
}




